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Abstract. - In a recent paper, Krapivsky and Redner [1] proposed a new growing network 
model with new nodes being attached to a randomly selected node, as well to all ancestors of the 
target node. The model leads to a sparse graph with an average degree growing logarithmically 
with the system size. Here we present compeling evidence for software networks being the result 
of a similar class of growing dynamics. The predicted patternd of network growth, as well as 
the stationary in- and out-degree distributions are consistent with the model. Our results 
confirm the view of large-scale software topology being generated through duplication-rewiring 
mechanisms. Implications of these findings are outlined. 



Introduction. The structure of many natural and artificial systems can be depicted 
with networks. Empirical studies on these networks have revealed that many of them display 
a heterogenous degree distribution p(k) w fc~ 7 , where few nodes (hubs) have a large number 
of connections while the majority of nodes have one or two links [2]. The existence of hubs 
has been related to multiplicative effects affecting network evolution [3]. Such topological 
patterns have been explained by a number of mechanisms, including preferential attachment 
rules [4] and network models based on simple rules of node duplication [5]. A very simple 
approach is given by the growing network model with copying (GNC) [1]. The network grows 
by introducing a single node at a time. This new node links to m randomly selected target 
node(s) with probability p as well to all ancestor nodes of each target, with probability q (see 
fig. HJ. The discrete dynamics follows a rate equation [1] 



L(N + l)=L(N) + -lT,(p + qj»)) (1) 




where L and N are the number of links and nodes, respectively. The second term in the 
right-hand side describes the copying process, where the average number of links added is 
given by p + qj^ . The fi index refers to the node fi, to be selected uniformly from among 
the N elements. Assuming a continuum approximation, the number of links is driven by the 
following differential equation: 
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Fig. 1 - (A) Illustration of the copying rule used in the network growth model. Each node is labeled 
with a number indicating its age (number one is the oldest). In the figure, new node vq attaches to 
target node Vi with probability p. This new node inherits every link from target node (dashed links) , 
with probability q. (B) Synthetic network obtained with the GNC model with N = 100, m = 1, p = 1 
and g = l. (C) Synthetic network obtained with the GNC model with N = 100, m = 4, p = 0.25 
and q = 0.25. These networks have a scale-free in-degree distribution and an exponential out-degree 
distribution. 



dL L 

dN= mP + mq N (2) 

The asymptotic growth of the average total number of links depends on the extent of 
copying defined by the product mq. In particular, logarithmic growth is recovered when mq = 
1 and L(N) = mpN log N. This corresponds to a marginal situation separating a domain of 
linear growth (mq < 1) to a domain of exponential growth (2 > mq > 1). Interestingly, 
for mq = 1 the GNC model predicts a power-law in-degree distribution Pj(fe) « fc~ 7i with 
exponent 7, = 2 and an exponential out-degree distribution P a (k), independently of copying 
parameters. Actually, their derivation for the in-degree distribution can be generalized for 
arbitrary q and p values, leading to a scaling law Pi(k) « k~ 2 for the parameter domain of 
interest. In ref. [1] the authors showed that the GNC model seems to consistently explain the 
patterns displayed by citation networks. Here, we show that a GNC model is also consistent 
with the evolution of software designs, which also display the predicted logarithmic growth. 

Software Networks. - One of the most important technological networks, together with 
the Internet and the power grid, is represented by a broad class of software maps. Software 
actually pervades technological complexity, since the control and stability of transport systems, 
Internet and the power grid largely rely on sotfware-based systems. In spite of the multiplicity 
and diversity of objectives and functionalities addressed by software projects, we have pointed 
out the existence of strong universals in their internal organization [7]. Computer programs 
are often decomposed in a collection of text files written in a given programming language. In 
this paper, we will study computer programs written in C or C++ [6]. Program structure can 
be recovered from the analysis of program files by means of a directed network representation. 
In a software network, software entities (files, classes, functions or instructions) map onto 
nodes and links representing syntactical dependencies [7]. Class graphs (also called 'logical 
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Fig. 2 - (A) Largest connected component of the XFree86 include network at 15/05/1994 (with 
N = 393) displays scale-free behavior (see text). In (B), the cumulative distributions Pi>{k) and 
P > (k) are shown for a more recent version of the XFree86 include network with N = 1299 (not shown 
here). The power-law fit of the in-degree distribution yields Pi(k) ~ k~' 1 ^ _1 with 7? = 0.97 ± 0.01 
while the out-degree distribution is exponential. In (C) we can notice similar features for the in- 
degree and out-degree distributions of the Aztec include network at 29/3/2003. For this system, the 
power- law fit of the in-degree distribution yields an exponent 7° = 1.22 ± 0.03. 



dependency graph' [8]) are a particular class of software networks that has been shown to 
be small- world and scale-free network with an exponent 7 2.5 [7,9, 10]. Interestingly, the 
frequencies of common motifs displayed in class graphs can be predicted by a very simple 
duplication-based model of network growth [11]. This result indicates that the topology of 
technological designs, in spite of being planned and problem-dependent, might actually emerge 
from common, distributed rules of tinkering [12]. In the following, we provide further evidence 
for the importance of duplication processes in the evolution of software networks. 

Here, we study a new class of software networks. We use the so-called 'include graph' (or 
'physical dependency graph' in [8] ) G — (V, E) where Vi G V is a program file and a directed 
link (vi,Vj) E E indicates a (compile-time) dependency between file Vi and Vj. In C and 
C++, such dependencies are encoded with the keyword "#include" followed by the name of 
the refereed source file [8]. In order to recover the include graph, we have implemented a 
network reconstruction algorithm that analyses the contents of all files in the software project 
looking for this reserved keyword. Every time this keyword is found in a file Vi, the name 
of the refereed file Vj is decoded and a new link (vi,Vj) is added. No other information 
is considered by the network reconstruction algorithm. Notice that the include network is 
unweighted because it makes no difference to include the same file twice. 

In this paper, we investigate the structure and evolution of software maps by looking 
at their topological structure and the time series of aggregate topological measures, such as 
number of nodes N(t), number of links L(t) or average degree k(t) = L(t)/N{t). It is worth 
mentioning that the number of nodes in a include graph coincides with the number of files in 
the software project, which is often used as a measure of project size. 

Software maps typically display asymmetries in their in-and out-degree distributions [9, 
10] although the origins of such asymmetry remained unclear. Notice how the out-degree 
and in-degree distributions of real include networks are quite similar to the corresponding 
distributions obtained with the GNC model (see previous section). The in-degree and out- 
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degree distributions for the largest component of two different systems are shown in fig. 

oo 

and fig. where we have used the cumulative distributions P>(k) = j P(k)dk. In both 

k 



cases, in-degree distributions display scaling P%(k) ~ fc~ 7i , where the estimated exponent 
is consistent with the prediction from the GNC model, whereas out-degree distributions are 
single-scaled (here the average value for the systems analysed is (7$) = 2.08 ± 0.04 [13]). As 
shown in the next section, these stationary distributions result from a logarithmic growth 
dynamics consistent with the GNC model. 

Software Evolution. - Although an extensive literature on software evolution exists (see 
for example [14,15], little quantitative predictions have been presented so far. Most studies 
are actually descriptive and untested conjectures about the nature of the constraints acting 
on the growth of software structures abound. It is not clear, for example, if the large-scale 
patterns are due to external constraints, path-dependent processes or specific functionalities. 
In order to answer these questions, we have compared real software evolution with models of 
network growth, where software size is measured as the number of nodes in the correspond- 
ing include graph. In this context, the assumptions of the GNC model are consistent with 
observations claiming that code cloning is a common practice in software development [15]. 
Indeed, comparison between real include graphs and those generated with the GNC model 
suggests the extent of copying performed during software evolution is a key parameter that 
explains the overall pattern of software growth. Such a situation has been also found in class 
diagrams [11]. 

The growth dynamics found in include graphs is logarithmic (see fig. |3J^) thus indicating 
that we are close to the mq = 1 regime. Indeed, the sparseness seen in software maps is likely 
to result from a compromise between having enough dependencies to provide diversity and 
complexity (which require more links) and evolvability and flexibility (requiring less connec- 
tions). Here we have uneven, but detailed information of the process of software building. 
In this context, different software projects developments display specific patterns of growth. 
Specifically, the number of nodes N grow with time following a case-dependent functional 
form N = 4>(t). Using dL/dt = (dL/dN)(d(/)/dt), we have from, eq. J2J), 



dL 
~dt 



mp + mq 



*(*) 



with a general solution 



L(t) 



1 J (**) 



mp 



- mq J(<S>i) ldt $-i^ + r 



(3) 



(4) 



where T is a constant. Using a linear law growth (which is not uncommon in software devel- 
opment), i.e. N(t) = Nq + at, and assuming mq = 1, we have, 



L(t) = (N a + at) 



mplog 



N +at 
N 



No 



(5) 



However, typical time series of L(t) in real software evolution is subject to fluctuations 

(see fig. EK)- In order to reduce the impact of fluctuations we use the cumulative average 
t 

degree K{t) — J (L/N)dt, instead. Assuming the number of nodes growths linearly in time, 



we obtain: 
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Table I - Predictions of eg. 0) for different systems. 



Project 


a 


No 




mp 


L 







T 


XFree86 


0.0086 ± 0.0001 


622.17 ± 


10.92 


2.20 


± 


0.01 


1419.8C 


± 


4.09 


243 


Postgresql 


0.0066 ± 0.0002 


601.42 ± 


11.35 


1.78 


± 


0.05 


243.89 


± 


8.46 


31 


DCPlusPlus 


0.004 ± 0.0001 


101.51 ± 


2.42 


0.70 


± 


0.03 


338.96 


± 


1.30 


74 


TortoiseCVS 


0.0057 ± 0.0001 


97.57 ± 


2.62 


1.59 


± 


0.02 


105.76 


± 


1.58 


107 


Aztec 


0.026 ± 0.002 


205.12 ± 


22.17 


0.97 


± 


0.03 


622.61 


± 


4.77 


14 


Emule 


0.016 ± 0.0006 


98.01 ± 


6.37 


1.65 


± 


0.11 


223.80 


± 


9.34 


54 


VirtualDub 


0.0079 ± 0.0004 


167.04 ± 


12.44 


1.34 


± 


0.05 


381.50 


± 


5.16 


35 



K . = mp(N a + at) 



L mpNo 

N- t+ — (6) 



The above expressions can be employed to estimate the parameters Lq and mp describing 
the shape of the logarithmic growth of number of links L(t) and the parameters Nq and a con- 
trolling the linear growth of the number of nodes N(t). We used the following fitting procedure. 
For each software project, we have recovered a temporal sequence {G t — (V t ,E t ) |0 < t < T} 
of include networks corresponding to different versions of the software project. Time is mea- 
sured in elapsed hours since the first observed project version (which can or cannot coincide 
with the beginning of the project). This temporal sequence describes the evolution of the 
software project under study. From this sequence, we compute the evolution of the number of 
nodes no, m, n 2, ^t, the evolution of the number of links lo, h, 1%, It and the evolution 
of the average degree fco, fe, ki — k/rii, kx- In general, available data is a partial set of 
records of development histories and often misses the initial project versions corresponding to 
the early evolution. Then, to and this explains why the initial observations for no and Iq 
are higher than expected. However, we have rescaled time so the first datapoint corresponds to 
zero. We have collected partial ( 1 ) evolution registers for seven different projects (relevant time 
period is in parenthesis): XFree86 (16/5/94 - 1/6/05), Postgresql (1/1/95 - 1/12/04), DC- 
PlusPlus (1/12/01 - 15/12/04), TortoiseCVS (15/1/01 - 1/6/05), Aztec (22/3/01 - 14/4/03), 
Emule (6/7/02 - 26/7/05) and VirtualDub (15/8/00 - 10/7/05) [13]. The full database com- 
prises 557 include networks (see table HJ. 

Then, we proceed as follows. First, for each software project, its time series for the 
number of nodes is fitted under the assumption of linear growth, i.e. N(t) = Nq + at, and 
thus yielding Nq (initial number of nodes) and a (rate of addition of new files). In table [Q, we 
can appreciate that the majority of projects growth at a rate a proportional to 10~ 3 files/hour 
while two medium size projects (Aztec and Emule) actually grow by an order of magnitude 
faster. Next, we compute the time series of cumulative average degree K{t) by integrating 
numerically the sequence of kt values. This new sequence will be fitted with eq. HJ in order 
to estimate the parameters Lq (initial number of links) and the product mp controlling the 
extent of duplication. 

(^Actually, these datasets constitute a coarse sampling of the underlying process of software change. Col- 
lecting software evolution data at the finest level of resolution requires a monitoring system that tracks auto- 
matically all changes made by programmers. Instead, it is often the programmer who decides when a software 
register is created. The issue of fine-grained sampling is an open research question in empirical software 
engineering that deserves more attention. These limitations preclude us a more direct testing of the GNC 
model. 
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Fig. 3 - (A) The top curve shows the comparison between the time evolution of number of links 
L(i) in XFree86 between 16/05/1994 and 01/06/2005 (points) and the prediction of eq. @ (dashed 
line). In the bottom curve we compare the time evolution of system size N(t) and its linear fitting 
N(t) — No + at (dashed line). We observe an anomalous growth pattern followed by a discontinuity 
(here indicated as ti and £2) in L(t). Notice how t-z signals a discontinuity both in L(t) and N(t), while 
discontinuity ti only takes place in L(t). (B) Comparison between time evolution of the cumulative 
average degree in XFree86 during the same time period as in (A) and the analytic prediction of eq. J^J . 
(C) The inset shows the same data as in (B) but in a double logarithmic plot. The fitting parameters 
are: N = 622.17 ± 10.92, a = 0.0086 ± 0.0002, L = 1419.8 ± 4.1, and mp = 2.20 ± 0.01. Time is 
measured in hours. 



In fig. |3J3 we show the result of the previous fitting procedure to the time series of cu- 
mulative average degree K(t) in XFree86, a popular and freely re-distributable open source 
implementation of the X Windows System [13]. As shown in the figure, the agreement between 
theory and data is very good. We have validated the same logarithmic growth pattern in the 
evolution of other software systems (see table QJ. In particular, we provide a prediction for 
the average number of links to target nodes, mp, which is found to be small. This is again 
expected from the sparse graphs that are generated through the growth process. 

Together with the overall trends, we also see deviations from the logarithmic growth fol- 
lowed by reset events. In fig. OK we can appreciate a pattern of discontinuous software growth 
in the number of links L(t) for XFree86. The time interval delimited by t\ and £2 is the 
signature of a well-known major redesign process that enabled 3D rendering capabilities in 
XFree86. This new feature of XFree86 was called Direct Rendering Infrastructure (DM). De- 
velopment of DRI is cleary visible in the time series of L(t). At t\ (i.e., August of 1998) the 
design of DRI was officially initiated and the event ti (i.e., July of 1999) corresponds to the 
first public release of the DRI technology (i.e., DRI 1.0) [16]. A careful look at the time series 
L{t) shows that before the discontinuities (indicated by t\ and £2), some type of precursor 
patterns were detectable. 

The above example suggests how deviations from the logarithmic growth pattern can pre- 
dict future episodes of costly internal reorganization (so called refactorings [17]). In XFree86, 
the integration of DRI was a costly redesign process characterized by an exponential growth 
pattern in the number of links L(t). This accelerated growth pattern starts at t\ and finishes 
in a clearly visible discontinuity (indicated here by £2) that signals a heavy removal of links. 
After ^2 we observe a pattern of fast recovery eventually returning to the logarithmic trend 
described by eq. (JSJ) (dashed lines in fig fig. [3JY) . Such type of reset pattern has been also found 
in economic fluctuations in the stock market [18]. This trend needs to be explained and might 
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actually result from conflicting constraints leading to some class of marginal equilibrium state. 
This is actually in agreement with the patterns of activity change displayed by the community 
of software developers (unpublished results) which also exhibits scale-free fluctuations. 

* * * 
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